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Abstract 


Gene transfer agents (GTAs) are virus-like particles encoded and produced by many bacteria and archaea. Unlike viruses, 
GTAs package fragments of the host genome instead of the genes that encode the components of the GTA itself. As a result 
of this non-specific DNA packaging, GTAs can transfer genes within bacterial and archaeal communities. GTAs clearly 
evolved from viruses and are thought to have been maintained in prokaryotic genomes due to the advantages associated 
with their DNA transfer capacity. The most-studied GTA is produced by the alphaproteobacterium Rhodobacter capsulatus 
(RcGTA), which packages random portions of the host genome at a lower DNA density than usually observed in tailed bacte- 
rial viruses. How the DNA packaging properties of RCGTA evolved from those of the ancestral virus remains unknown. To 
address this question, we reconstructed the evolutionary history of the large subunit of the terminase (TerL), a highly con- 
served enzyme used by viruses and GTAs to package DNA. We found that RcGTA-like TerLs grouped within viruses that em- 
ploy the headful packaging strategy. Because distinct mechanisms of viral DNA packaging correspond to differences in the 
TerL amino acid sequence, our finding suggests that RcGTA evolved from a headful packaging virus. Headful packaging is 
the least sequence-specific mode of DNA packaging, which would facilitate the switch from packaging of the viral genome 


to packaging random pieces of the host genome during GTA evolution. 
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1. Introduction 


Gene transfer agents (GTAs) are virus-like particles encoded 
and produced by certain bacteria and archaea (reviewed most 
recently by Lang, Westbye, and Beatty (2017) and Grull, 
Mulligan, and Lang (2018)). Unlike viruses, GTAs package frag- 
ments of the host genome instead of the genes that encode the 
GTA itself (Hynes et al. 2012). When GTA particles infect another 
cell, they can transfer the encapsidated genetic material to the 
recipient (Solioz, Yen, and Marris 1975; McDaniel et al. 2010; 
Hynes et al. 2012). The genomic loci that encode GTAs resemble 


prophages, indicating that GTAs evolved from viral ancestors. 
Although the function of GTAs is not firmly established, the pre- 
vailing hypothesis is that GTAs are not defective prophages, but 
instead are agents of horizontal gene transfer that are main- 
tained in prokaryotic genomes due to the advantages associated 
with gene exchange, particularly, in stressful conditions (Lang, 
Westbye, and Beatty 2017; Kogay et al. 2020). 

The best-studied GTA is produced by the alphaproteobacte- 
rium Rhodobacter capsulatus, and will be referred to as RcGTA. 
Production of the RcGTA particles is triggered by environmental 
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factors (Westbye et al. 2017a), occurs in a small fraction of the 
population (Fogg, Westbye, and Beatty 2012; Hynes et al. 2012), 
is regulated by host proteins (Westbye, Beatty, and Lang 2017b; 
Ding et al. 2019; Fogg 2019) and involves expression of genes 
that are found in at least five loci in the R. capsulatus genome 
(Hynes et al. 2016; Lang, Westbye, and Beatty 2017). The largest 
of these loci is a 17-gene ‘head-tail cluster’ that encodes pro- 
teins involved in head-tail morphogenesis and DNA packaging 
(Lang, Westbye, and Beatty 2017). There are homologs of the 
RcGTA head-tail cluster genes in other alphaproteobacteria 
(Shakya, Soucy, and Zhaxybayeva 2017), including several 
Rhodobacterales for which GTA production has been observed (Fu 
et al. 2010; Nagao et al. 2015). Additionally, homologs of head- 
tail cluster genes are present in numerous viruses and provi- 
ruses (Shakya, Soucy, and Zhaxybayeva 2017). 

The small size of RCGTA and the low density of its packaged 
DNA precludes the particle from accommodating all of the 
genes required for its production (Bardy et al. 2020). Instead, 
RcGTA packages seemingly random portions of the host ge- 
nome (Hynes et al. 2012). In double-stranded DNA viruses of the 
realm Duplodnaviria, genome packaging is mediated by the ter- 
minase and portal proteins (Fokine and Rossmann 2014; Rao 
and Feiss 2015). Viral terminases typically consist of large and 
small subunits (Rao and Feiss 2008). The small subunit (TerS) 
binds to the DNA to be packaged and then recruits the large sub- 
unit (Casjens 2011; Rao and Feiss 2015). The large subunit (TerL), 
which consists of ATPase and nuclease domains, cuts the con- 
catemeric viral DNA, translocates the DNA into the viral capsid 
with concomitant ATPase hydrolysis and, finally, cuts the DNA 
again to terminate packaging (Casjens 2011; Rao and Feiss 2015). 
Viruses evolved different strategies for packaging DNA into 
their capsids, and these strategies involve different classes of 
TerL (Casjens and Gilcrease 2009). Some viruses employ a head- 
ful packaging strategy whereby TerL initially cuts the viral con- 
catemeric DNA at a specific sequence (pac site) and terminates 
packaging when the capsid is full, rather than at a particular se- 
quence (Rao and Feiss 2008; Casjens and Gilcrease 2009). The 
virions produced by headful phages usually package more than 
100 per cent of the phage genome length, and as a result, have 
terminally redundant, circularly permuted chromosomes 
(Casjens and Gilcrease 2009). 

Viral TerLs with the same packaging mechanism tend to 
form clades in phylogenetic trees (Casjens et al. 2005). However, 
due to the diversity of TerL sequences, the relationships among 
the different functional classes of TerLs are not well-resolved 
(Casjens et al. 2005; Casjens and Gilcrease 2009; Merrill et al. 
2016). Given its lack of sequence specificity, RCGTA is presumed 
to package fragments of the host DNA via the headful strategy 
(Casjens et al. 2005; Hynes et al. 2012). The RcGTA TerL and its 
alphaproteobacterial homologs formed a distinct group in previ- 
ous phylogenies, but they did not cluster with or within the viral 
TerLs that are involved in headful packaging (Casjens et al. 
2005; Casjens and Gilcrease 2009). Therefore, these phylogenies 
did not provide evidence of a headful packaging strategy in 
alphaproteobacterial GTAs, in part, due to limited sequence 
data available at the time. 

In this study, we conducted a comprehensive evolutionary 
analysis of TerL sequences to better resolve the phylogenetic re- 
lationship of alphaproteobacterial RcGTA-like TerLs and viral 
TerLs with known packaging strategies. Our analyses suggest 
that RcGTA, and, by inference, the rest of the putative alphapro- 
teobacterial GTAs, evolved from a virus that employed the 
headful packaging strategy. We also identified two amino acid 
substitutions that are conserved in the TerLs of the putative 


alphaproteobacterial GTAs and might be important for the DNA 
packaging properties of GTAs. 


2. Methods 


2.1 Retrieval and sequence-based clustering of large ter- 
minase homologs 


Eighteen profiles covering one or both (ATPase and nuclease) 
domains of TerL were retrieved from the NCBI Conserved 
Domains Database (CDD) (Lu et al. 2020) (accessed on 11 
December 2018). Two TerL profiles for distinct families of bacte- 
rial and archaeal viruses were added from Philosof et al. (2017) 
and Yutin et al. (2018). These 20 profiles (Supplementary Table 
S1) were used as queries for PSI-BLAST searches (E-value 
threshold of 0.01, effective database size of 2 x 10’ sequences) 
(Altschul et al. 1997) against the NCBI non-redundant protein 
database (accessed in December 2018). Only the subject sequen- 
ces that were taxonomically assigned to archaea, bacteria, and 
viruses were retained. 

Partial TerL sequences were removed by ensuring the pres- 
ence of both an ATPase (N-terminal) and a nuclease (C-termi- 
nal) domain using the following criteria: sequences either had 
to align to >75 per cent of a ‘full’ TerL profile that includes both 
TerL domains or align to different TerL profiles over their N- 
and C-terminal domains. A sequence was considered to align 
over a specific domain if it met one of two conditions: 1, if the 
sequence matched >75 per cent of an N- or C-terminal domain- 
specific profile and had at least 35 per cent of the protein length 
outside of the matched domain to contain the unmatched do- 
main or 2, if a sequence aligned to just the N- or C-terminal por- 
tion of a ‘full’ TerL profile and had at least 35 per cent of the 
protein length outside of the matched domain to contain the 
unmatched domain. 

The resulting 254,382 sequences were clustered using 
MMsegs2 (Steinegger and Soding 2017) with a similarity thresh- 
old of 0.75. From each of the obtained 11,298 clusters, a repre- 
sentative sequence of median length was selected for 
subsequent analyses. 


2.2 Alignment of representative homologs and filtering 
out partial sequences 


The representative TerL sequences were iteratively aligned and 
clustered using the approach described by Wolf et al. (2018). 
Briefly, the sequences were clustered with a similarity threshold 
of 0.5 using UCLUST (Edgar 2010). The clustered sequences were 
aligned using MUSCLE (Edgar 2004) and alignment sites that 
contained more than 67 per cent gaps were temporarily re- 
moved. Pairwise similarity scores between cluster alignments 
were calculated using HHSEARCH (Sodding 2005), converted to a 
distance matrix and used to build a UPGMA tree (Sokal and 
Michener 1958). All of the branches of the UPGMA tree above a 
depth threshold of 2.3 were used to guide progressive alignment 
of the clusters using HHALIGN (Soding 2005). The removed sites 
were reinserted back into their original sequences after the pro- 
file-profile alignment. These alignment and clustering steps 
were repeated for 20 iterations, when 11,230 of the sequences 
formed one alignment. 

Of the remaining 68 sequences that failed to align, two were 
clearly TerLs but contained inteins, which were manually re- 
moved. Five other sequences were also likely TerLs because 
they were longer than 300 amino acids, exhibited significant 
similarity to a TerL profile via CDD searches (E-value <0.001), 
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and contained recognizable Walker A motif and nuclease cata- 
lytic residues. The seven sequences were profile-aligned to the 
alignment of 11,230 sequences using more relaxed criteria (sim- 
ilarity threshold of 0.01 and UPGMA depth threshold of 6). The 
remaining 61 sequences did not meet these criteria and were 
discarded. 

The new alignment of 11,237 sequences contained partial 
sequences that lacked a Walker A motif. To remove these, each 
sequence’s similarity was scored to the alignment’s consensus 
sequence using a BLOSUM62 substitution matrix and the score 
was compared to the score of sequences with 100 per cent iden- 
tity to the consensus sequence. Sequences with a score less 
than 10 per cent of the perfect match score were removed. 
Then, the alignment was used as a PSSM in a PSI-BLAST search 
against all of the sequences within the alignment. Only the 
sequences that matched to >75 per cent of the PSSM were 
retained. The sections of the 11,060 sequences that passed this 
criterion were extracted and re-aligned using the above- 
described iterative alignment procedure with a clustering simi- 
larity threshold of 0.5 and UPGMA depth threshold of 2.3. After 
23 iterations of alignment and clustering, 11,057 of the sequen- 
ces aligned. The three sequences that did not align were longer 
than 300 amino acids, exhibited significant similarity to a TerL 
profile via CDD searches (E-value <0.001), and contained a rec- 
ognizable Walker A motif and nuclease catalytic residues. 
Therefore, they were retained and aligned to the main align- 
ment using more relaxed parameters of a clustering similarity 
threshold of 0.01 and UPGMA depth threshold of 6. 


2.3 Alignment trimming 


The alignment of 11,060 sequences was trimmed to remove all 
columns with more than 50 per cent gaps and less than 10 per 
cent amino acid homogeneity. The homogeneity value of an 
alignment column was defined and calculated using the follow- 
ing procedure. For each of the N=11,060 sequences, column- 


N 
based sequence weights w; (z Ww; = ) were assigned accord- 
i=t 


ing to Henikoff and Henikoff (1994). The score of an alignment 


column against an amino acid x was calculated as 


Q = > W;Sq,x, Where aj is an amino acid in the i-th sequence 
i=1 

and Sg, is the BLOSUM62 substitution matrix score for a pair of 
amino acids a; and x (Henikoff and Henikoff 1993). As the con- 
sensus amino acid of the column c, the amino acid with the 
highest score Q,, that is, c = argmax,, Qx, was selected. An expec- 
tation of the score of the given alignment column against a ran- 
domly selected amino acid R was calculated as Qk = >>, fyQr, 
where f, is the vector of relative frequencies of amino acids 
(oO, fo = 1, b € {Ala, Tyr}). The homogeneity of an alignment col- 
umn was defined as H= max (2=% 0). The homogeneity 
ranges from 0 (the alignment column score does not exceed the 
random expectation score Qa) to 1 (the alignment column score 
is equal to the maximum possible score S¢,). 





2.4 Reconstruction of large terminase phylogeny 


The trimmed alignment of 11,060 sequences was used to recon- 
struct an initial phylogenetic tree in FastTree v. 2.1.4 (Price, 
Dehal, and Arkin 2010) using the Whelan and Goldman (WAG) 
substitution model (Whelan and Goldman 2001) and 20 gamma- 
distributed rate categories (Yang 1994). The initial tree was used 
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as a guide tree to refine our alignment, which in turn was used 
to reconstruct an improved tree. To this end, the sequences 
were divided into two sets, those that formed distinct groups on 
the tree and the remaining ones. The sequences that formed 
distinct groups were re-aligned using a clustering similarity 
threshold of 0.01 and UPGMA depth threshold of 2.3, whereas 
the other sequences were aligned using more stringent parame- 
ters (similarity threshold of 0.66 and UPGMA depth threshold of 
1.3). These alignments were profile-aligned using a similarity 
threshold of 0.5 and UPGMA depth threshold of 2.3. Nine 
sequences did not join the main alignment and were discarded 
due to low scores against the consensus, calculated with a 
BLOSUM62 matrix as described above. The resulting alignment 
of 11,051 sequences was trimmed to remove all columns with 
more than 50 per cent gaps and less than 10 per cent amino acid 
homogeneity, as defined above. The final phylogenetic tree of 
11,051 TerL homologs (Fig. 1) was reconstructed using FastTree 
v. 2.1.4 (Price, Dehal, and Arkin 2010) with the WAG substitution 
model (Whelan and Goldman 2001) and 20 gamma-distributed 
rate categories (Yang 1994). 


2.5 Identification of RCGTA-like large terminases 


In the phylogenetic tree of TerLs (Fig. 1), a group of 616 TerLs was 
labeled as the ‘RcGTA-like containing group’. This group includes 
507 TerLs of the 526 RcGTA-like TerLs that were identified and 
curated by Kogay et al. (2019). Nineteen RcGTA-like TerLs from 
the dataset of Kogay et al. (2019) are absent in our dataset be- 
cause they were not in GenBank at the time of our data collection 
(December 2018). The 616 TerLs were classified as either ‘RcGTA- 
like’ or ‘virus-like’ using a machine learning approach imple- 
mented in the GTA-Hunter program (Kogay et al. 2019). 


2.6 Examination of the genomic neighborhoods of large 
terminases 


For the 604 TerLs within the ‘RcGTA-like containing’ group 
(Fig. 1) that originated from bacterial and archaeal genomes, the 
presence of 11 other RcGTA-like genes near the terL gene was 
examined. To this end, the RcGTA-like genes from the training 
set of Kogay et al. (2019) were used as queries in a BlastP search 
against the assemblies of the bacterial and archaeal genomes 
(E-value <0.001; query and subject overlap by at least 60 per 
cent of their length) (Altschul et al. 1997). The detected RcGTA 
gene homologs were classified as ‘RcGTA-like’ or ‘virus-like’ us- 
ing GTA-Hunter (Kogay et al. 2019). The detected RcGTA gene 
homologs were also clustered into regions using DBSCAN, with 
a maximum distance cutoff of 8,000 bp between adjacent genes 
(Ester et al. 1996; Kogay et al. 2019). If a terL gene was embedded 
in a region containing at least 6 of the 11 RcGTA-like genes, it 
was Classified as being in a ‘large RcGTA-like element’. If a terL 
gene was located in a region containing 1 to 5 RcGTA-like genes, 
it was classified as being in a ‘small RcGTA-like element’. The 
classification of the sequence represented in the phylogeny was 
assumed to be the same for the rest of the cluster members al- 
though this was not directly verified. 


2.7 Assignment of packaging strategy to viral large 
terminases 


A list of viruses with experimentally determined packaging 
mechanisms was compiled from the phylogenetic tree of Casjens 
and Gilcrease (2009). Viruses from the phylogenetic tree of Merrill 
et al. (2016) were also added, for most of which experimental evi- 
dence of the packaging mechanism is available. The dataset of 
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Figure 1. Phylogeny of TerLs from 11,051 viruses, prophages, and GTAs. The TerL protein from RcGTA is denoted by a pink bar. The pink bracket outlines a subtree that 
contains RcGTA-like TerLs and is shown in detail in Fig. 2. The green bars denote the viruses that have experimentally determined packaging strategies 
(Supplementary Table S2). The viruses with experimentally determined packaging strategies are labeled by the packaging strategy (Headful, Cohesive ends [COS], 
Direct Terminal Repeats [DTR], and Host Ends) followed by a prototype phage from that group (e.g., P22). Support values (aLRT) of >75 per cent denoted by red dots are 
only shown for a selection of branches relevant to grouping RcGTA-like TerLs within phages that employ a headful DNA packaging strategy. Scale bar, amino acid sub- 
stitutions per site. Tree topology with all support values is available in NEWICK format in Supplementary Data. The patterns of this phylogeny are consistent with 
those in the phylogenies reconstructed using a more accurate maximum likelihood inference carried out using the IQ-TREE program (see Supplementary Dataset); in 
particular, the RCGTA-like TerLs continue to form a well-supported group (100 per cent of bootstrap samples), which forms a sister group to TerLs of phages that utilize 


a headful packaging strategy (75 per cent of bootstrap samples). 


the 252,614 TerL homologs was searched for these 87 viruses us- 
ing TerL accession numbers provided by Merrill et al. (2016) and 
NCBI taxonomy IDs for the viruses from Casjens and Gilcrease 
(2009). Of the 87 viruses, 73 were present in our dataset 
(Supplementary Table S2). Due to the close sequence similarity, 
some of the 73 TerLs belong to the same MMSEQ clusters, and 
therefore are represented by 58 TerLs on our phylogeny of 11,051 
TerLs (Fig. 1). The 58 representative viruses for which the packag- 
ing mechanism was not known were assigned the mechanism of 


a virus from the same cluster with a known packaging strategy, 
under the assumption that the similarity of their TerL amino acid 
sequences is sufficient to imply the same packaging mechanism. 


2.8 Validation of the reconstructed phylogenetic patterns 
with more accurate maximum likelihood analyses 


To confirm that the phylogenetic relationships obtained from 
the FastTree program (Price, Dehal, and Arkin 2010) were not 
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impacted by its limited tree search and optimization capabili- 
ties, additional phylogenetic trees were reconstructed using IQ- 
TREE v 1.6.7 (Nguyen et al. 2015) from two datasets subsampled 
from the 11,051 TerLs. The first dataset of 342 TerLs was con- 
structed to broadly represent the TerL diversity (Fig. 1). 
The dataset contains the 58 representative viral TerLs (de- 
scribed in Section 2.7), 50 TerLs randomly sampled from group 1 
of the ‘RcGTA-like containing group’ (Fig. 2), all TerLs from 
group 2, 70 TerLs from group 3, 50 TerLs randomly sampled 
from the rest of the ‘RcGTA-like containing group’, and 100 ran- 
domly sampled TerLs from the rest of the whole TerL tree 
(Fig. 1). The second dataset of 346 TerLs was constructed to rep- 
resent well the TerLs from the ‘RcGTA-like containing group’ 
(Figs 1 and 2). The dataset contains 12 representative viral TerLs 
with either P22 or Sf6-like headful packaging strategies, 70 
TerLs randomly sampled from group 1, all TerLs from group 2, 
and 250 TerLs randomly sampled from group 3. For both data- 
sets, the aligned sequences were retrieved from the trimmed 
alignment of 11,051 TerLs (see Section 2.4) and gap-only sites 
were removed. The optimal evolutionary model was selected 
using ModelFinder (Kalyaanamoorthy et al. 2017), as imple- 
mented in IQ-TREE. Support values for branches were calculated 
using ultrafast bootstrap approximation with 1,000 replicates 
(Hoang et al. 2018), as implemented in IQ-TREE. The tree 
reconstructed from the second dataset was rooted with the 
headful P22 viral TerL that, out of the headful P22 viral TerLs in 
Fig. 1, is the most distantly related to the ‘RcGTA-like containing 


group’. 


2.9 Annotation of prophages in regions encoding virus- 
like TerLs 


To examine whether the bacterial TerLs classified as virus-like 
reside in prophages, prophages were predicted in the corre- 
sponding nucleotide genome sequences using PHASTER (Arndt 
et al. 2016, accessed in May 2020). The regions labeled as intact 
(score >90) or questionable (score 70-90) prophages were 
retained, while the regions labeled as incomplete prophages 
(score <70) were discarded. One region surrounding the virus- 
like TerL (accession WP_020474221) was predicted by PHASTER 
to be an intact prophage and was also classified as a small 
RcGTA-like element by GTA-Hunter, due to the presence of one 
RcGTA-like gene. The PHASTER prediction of this region as a 
prophage was considered to supersede the small RcGTA-like el- 
ement classification. 


2.10 Detection of conserved sites that differentiate 
RcGTA-like and virus-like TerLs 


The amino acid sequences of the 616 TerLs in the subtree 
shown on Fig. 2 were re-aligned using MAFFT-linsi v. 7.305 
(Katoh and Standley 2013). The alignment was scanned for sites 
conserved in more than 80 per cent of the TerL homologs from 
group 1 (virus-like) or group 3 (RcGTA-like) (Fig. 2). Of these 
detected sites, only the sites with at least 70 per cent between- 
group difference in the relative abundance of the most con- 
served amino acid were retained. 

The positions of the two identified sites relative to the 
known TerL structural domains were determined by searching 
CDD (Lu et al. 2020) with RcGTA TerL RefSeq record 
WP_031321187 as a query (database accessed on 26 July 2020). 
The conservation of the two detected sites within the ‘RcGTA- 
like containing group’ was visualized using a subset of the 616 
TerLs (Fig. 2): 15 randomly sampled TerLs from group 1, all 
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Figure 2. A subtree of the phylogeny shown on Fig. 1 that contains RcGTA-like 
TerLs. Bars in the first column next to the branches of the subtree indicate 
whether the TerLs are from true viruses or, if they are found in bacterial or ar- 
chaeal genomes, whether they were classified by GTA-Hunter as ‘RcGTA-like’ or 
‘Virus-like’. Bars in the second column denote whether the genomic neighbor- 
hood of the terL gene contains at least six RCGTA-like genes (‘large RcGTA-like 
element’), between one and five RcGTA-like genes (‘small RcGTA-like element’), 
a questionable prophage or an intact prophage. TerLs without a colored bar in 
the second column were not predicted as being in a prophage or RcGTA-like ele- 
ment. Group 1 includes mostly virus-like TerLs as well as TerLs from predicted 
prophages and true viruses. Group 2 contains a mixture of TerLs from a large el- 
ement, small RcGTA-like elements, and predicted prophages. Group 3 mostly 
contains RcGTA-like TerLs from ‘true’ GTAs (large RcGTA-like elements). Red 
dots indicate two nodes that are relevant for the grouping of the entire subtree 
and for the TerLs in group 3, and have aLRT support values of >75 per cent. The 
patterns of this phylogeny are consistent with those in the phylogeny recon- 
structed using IQ-TREE (see Supplementary Data); in particular, the node corre- 
sponding to group 3 has 96 per cent bootstrap support. Scale bar, amino acid 
substitutions per site. 
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TerLs from group 2, 14 randomly sampled TerLs from group 3, 
and a representative TerL from the cluster that contains the 
RcGTA TerL. The aligned TerLs were retrieved from the 
alignment of 616 TerLs, and the gap-only alignment positions 
were removed. Secondary structure information was obtained 
via HHPred using RcGTA TerL as a query (Zimmermann et al. 
2018). 

The locations of the two sites were also visualized on a 3D 
structure of TerL from the Shigella phage Sf6 (PDB ID 4IDH) (Zhao 
et al. 2013) which, based on our phylogenetic inference, is the 
TerL most closely related to RCGTA-like TerLs for which a struc- 
ture is available. The homologous positions of the substitutions 
in the Shigella phage Sf6 TerL were identified by aligning it to 
the RcGTA TerL using HHPred (Zimmermann et al. 2018). The vi- 
sualization was carried out in PyMOL v 2.4 (Schrodinger, LLC 
2020). 


3. Results 


3.1 RcGTA-like TerLs belong within the group of TerLs of 
headful packaging phages 


Of the 11,051 representative TerL homologs from bacteria, ar- 
chaea, and viruses, 616 are closely related to the RcGTA TerL 
and form a well-supported group in the phylogenetic tree 
(Fig. 1). Of these 616 TerLs, twelve are encoded in viral genomes, 
whereas the remaining 604 are found in 601 bacterial and 3 ar- 
chaeal genomes. Using a machine learning approach that relies 
on amino acid composition, we classified the 604 bacterial and 
archaeal TerL homologs as either ‘RcGTA-like’ (527) or ‘virus- 
like’ (77) (Supplementary Table S3). By mapping 73 TerLs with 
experimentally determined packaging strategies onto the phy- 
logeny, we found that RcGTA-like TerLs fall, with strong sup- 
port, within a group of headful packaging phages (Fig. 1). 
Therefore, our phylogeny implies that the RcGTA-like TerLs 
evolved from a viral TerL that employed a headful packaging 
strategy and is thus consistent with the earlier proposed hy- 
pothesis that RcGTA-like TerLs use a headful mechanism to 
package host DNA (Casjens et al. 2005; Hynes et al. 2012). Of the 
TerLs from the viruses with experimentally determined packag- 
ing strategies, Enterobacteria phage P22-like TerLs are the closest 
relatives of the RcGTA-like TerLs. This affinity contrasts the pre- 
vious results, from analyses of much smaller data sets, accord- 
ing to which T4-like (Lang and Beatty 2000; Hynes et al. 2012) or 
T7-like (Casjens et al. 2005) TerLs were found to be most closely 
related to the RcGTA-like TerLs. 


3.2 Phylogenetic evidence of a single origin of GTA TerLs 


Whereas the TerLs of viruses that employ headful DNA packag- 
ing specifically package the viral genome into the capsid, 
RcGTA TerL lacks sequence specificity and packages random 
segments of the bacterial genome (Hynes et al. 2012). To evalu- 
ate if non-specific DNA packaging evolved once or multiple 
times, we first sought to determine more accurately which of 
the RcGTA-like TerLs likely belong to bona fide GTAs. Shakya, 
Soucy, and Zhaxybayeva (2017) hypothesized that genomic 
regions with a smaller number of recognizable RcGTA gene 
homologs are more likely to be prophages than GTAs because 
these regions tend to have a more virus-like GC content relative 
to their host, evolve faster and are more often associated with 
viral genes. Therefore, some of the TerLs classified as ‘RcGTA- 
like’ might not belong to RcGTA-like elements in cases when 
the alphaproteobacterial genomes that contain these genes lack 


homologs of other RcGTA genes. Among the 527 RcGTA-like 
TerLs, we classified 391 as ‘large’ (containing at least six RcGTA- 
like genes near the terL gene) and 136 as ‘small’ (1-5 RcGTA-like 
genes) elements (Fig. 2 and Supplementary Table S3). 

Within the subtree that contains RcGTA-like TerLs (Fig. 2), 
all but one of the TerLs found in large elements form a well- 
supported clade (group 3 in Fig. 2). The one TerL from a large el- 
ement that falls outside this clade is a representative of a clus- 
ter of three TerLs that are found in the genomes of 
alphaproteobacteria Zavarzinia compransoris DSM _ 1231, 
Zavarzinia sp. HR-AS and Oleomonas sp. K1W22B-8. This TerL 
belongs to a narrow ‘transition zone’ (group 2 in Fig. 2) between 
the group 3 TerLs and the deepest branches of the subtree that 
include exclusively viral and ‘virus-like’ sequences (group 1 in 
Fig. 2). The transition zone also contains a mix of RcGTA-like 
TerLs from small elements and virus-like TerLs, including the 
TerL from the intact prophage predicted in the genome of a 
planctomycete Zavarzinella formosa DSM 19928. None of the 
TerLs within this transition zone come from functionally char- 
acterized viruses or GTAs. Thus, the phylogeny indicates that 
RcGTA-like TerLs likely evolved only once from a viral TerL, in 
an ancestor of group 3, by acquiring the capability to package 
DNA non-specifically. The positions of the TerLs from 
Zavarzinia’s and Oleomonas’ putative GTAs and the Zavarzinella 
prophage could be explained by horizontal gene transfer, as pre- 
viously documented in some instances for other RcGTA-like 
genes (Yang et al. 2017) and discussed in detail below. 


3.3 Viruses might mediate horizontal gene exchange of 
RcGTA-like genes 


In addition to the above-discussed predicted prophage from 
Zavarzinella formosa DSM 19928, we identified two intact pro- 
phages in the genomes of the firmicute Thermoactinomyces sp. 
DSM 45892 and the alphaproteobacterium Methylobacterium ter- 
rae 17Sr1-28, and 16 viruses that encode TerLs that are phyloge- 
netically most closely related to the RcGTA-like TerLs (Tables 1 
and 2). Notably, the TerLs of these three prophages are even 
more closely related phylogenetically to the TerLs from large 
elements than the 16 viruses are (Fig. 2), but whether they pro- 
duce functional virions is unknown. 

Some of the 16 viruses encoding TerLs related to GTA TerLs 
infect GTA-containing alphaproteobacteria and possess genes 
that are more closely related to GTA genes than to their homo- 
logs in other viruses. For example, Dinoroseobacter phage 
vB_DshS-RSC contains homologs of four putative tail genes of 
the RcGTA (genes g12-g15; Yang et al. 2017). The predicted pro- 
phage in Zavarzinella formosa DSM 19928 encompasses a homo- 
log of the adaptor gene (g6) that is RcGTA-like in amino acid 
composition. These observations suggest an ongoing exchange 
and recombination of RcGTA-like genes between viruses and 
alphaproteobacteria, which likely explains the presence of 
virus-like TerLs within groups 2 and 3 (Fig. 2). The TerL phylog- 
eny also indicates that such gene exchange might extend be- 
yond alphaproteobacteria because at least four bacterial TerLs 
within groups 2 and 3 come from non-alphaproteobacterial 
genomes (OYV96073, WP_020474221, WP_110156686, and 
OQX66442). Because the viruses with known hosts infect a wide 
range of bacteria that live in environments similar to those oc- 
cupied by GTA-containing alphaproteobacteria (Table 1), they 
either might have an opportunity for gene exchange with vi- 
ruses that infect GTA-containing bacteria or might be capable of 
infecting GTA-containing bacteria, in addition to their currently 
known hosts. 
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Table 1. Sixteen viruses with TerLs most closely related to RCGTA-like TerLs. 
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Virus TerL On tree?* Host name Host taxonomic class Habitat GTA?> Reference‘ 
accession 
number 
Arthrobacter phage Tank ALY10550.1 Yes Arthrobacter sp. Actinobacteria Soil No GenBank? 
ATCC 21022 
Arthrobacter phage Wilde ALY10802.1 No Arthrobacter sp. Actinobacteria Soil No GenBank 
ATCC 21022 
Caulobacter phage Sansa AKU43425.1 Yes Caulobacter Alphaproteobacteria Aquatic Yes 26450723 
crescentus CB15 
Colwellia phage 9A AFK66668.1 Yes Colwellia psychrery- Gammaproteobacteria Cold No 23224375 
thraea 34H environments 
Dinoroseobacter phage ARBO06077.1 Yes Dinoroseobacter shibae Alphaproteobacteria Oceansurface Yes 28505134 
vB_DshS-R5C DFL12T 
Gordonia phage GMA2 AKJ72540.1 Yes Gordonia malaquae Actinobacteria Activated No 26241321 
A448 sludge 
Microbacterium phage AWNO03535.1 Yes Microbacterium Actinobacteria Soil No GenBank 
Hyperion foliorum NRRL 
B-24224 
Microbacterium phage AYB70129.1 Yes Microbacterium Actinobacteria Soil No GenBank 
OneinaGillian foliorum NRRL 
B-24224 
Microbacterium phage AWNO04641.1 No Microbacterium Actinobacteria Soil No GenBank 
Squash foliorum NRRL B- 
24224 
Nitrincola phage 1M3-16 AHX01069.1 Yes Nitrincola sp. 1M3-16 Gammaproteobacteria Hypersaline No GenBank 
alkaline lake 
Salinibacter virus M1EM-1 AUO78912.1 No Salinibacter ruber M1 Bacteroidetes Saltern No 29099492 
Salinibacter virus M8CR30-2 AUO79033.1 Yes Salinibacter ruber M8 Bacteroidetes Saltern No 29099492 
Salinibacter virus M8CR30-4 AUO79074.1 No Salinibacter ruber M8 Bacteroidetes Saltern No 29099492 
Streptomyces phage mu1/6 ABD94197.1 Yes Streptomyces Actinobacteria Soil No 18062183 
aureofaciens 
Environmental Halophage AFH22435.1 Yes Unknown, Unknown Saltern No 22479446 
eHP-25 hypothesized 
to be 
Nanohaloarchaea 
Uncultured Mediterranean ANSO03529.1 Yes Unknown Unknown Deep ocean No 27460793 


phage uvDeep-CGR2- 
KM21-C338 





*Whether the TerL is the one present on the tree or clustered with one of the other viral TerLs due to high sequence similarity. 


bWhether the host’s genome contains an RcGTA-like element. 
“PubMed ID of the paper that discusses the isolation and/or genome of the virus. 
4Direct submission to GenBank. 


3.4 Two amino acid changes distinguish GTA and viral 
TerLs 


Although viral TerLs that are most closely related to RCGTA-like 
TerLs have not been experimentally characterized, examination 
of amino acids that are conserved in RcGTA-like TerLs from 
large elements (group 3 on Fig. 2) but not in closely related viral 
TerLs (group 1 on Fig. 2), or vice versa, might help pinpoint the 
changes that are important for the unique packaging strategy of 
GTAs. We did not identify any amino acids that are conserved 
in group 1 TerLs but not in group 3 TerLs, but found two amino 
acids (located at positions 282 and 292 in the RcGTA TerL,; 
RefSeq record WP_031321187) that are conserved in the group 3 
TerLs but not in the group 1 TerLs (Supplementary Table S4 
and Fig. S1). In position 292, 99 per cent of the group 3 TerLs, 
but only 4 per cent of the group 1 TerLs, contain cysteine, 
whereas 59 per cent of the group 1 TerLs contain threonine. 
However, given that the threonine to cysteine substitution 
results in a reduction of the number of carbons per side chain, 


selection for the reduction in the energetic cost of GTA 
production (Kogay et al. 2020) cannot be excluded as a driver for 
this substitution. In position 282, 90 per cent of the group 3 
TerLs but no group 1 TerLs contain proline, whereas 64 per cent 
of the group 1 TerLs but only 6 per cent of the group 3 TerLs con- 
tain alanine. Proline contains two more carbons in its side 
chain than alanine, and therefore, this substitution cannot be 
selected for energetic cost savings. In TerLs from bona fide 
GTAs of R. capsulatus and Dinoroseobacter shibae, cysteine is 
found at position 292 in both proteins, whereas in position 
282 the Dinoroseobacter shibae TerL has proline and RcGTA TerL 
has alanine. 

Within the TerL protein structure, the two amino acids are 
located in the nuclease domain (Rao and Feiss 2015) and lie in 
close proximity at the opposite ends of a loop that extends to- 
ward the translocating DNA (Supplementary Fig. S2 and Figure 
3B in Zhao et al. (2013)). The importance of these residues with 
respect to the functionality of the GTA TerLs remains to be 
elucidated. 
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Table 2. Three predicted intact prophages with TerLs closely related to RCGTA-like TerLs. 








Host name TerL accession Predicted prophage coordinates Host taxonomic Host habitat Reference* 
number in the host’s genome class 
Zavarzinella formosa WP_020474221.1 NZ_JH636446.1: 39426-61037 Planctomycetia Wetlands 22740668 
DSM 19928 
Methylobacterium ter- WP_109959484.1 NZ_CP029553.1: 2863294—2900746 Alphaproteobacteria Soil 31463788 
rae 17Sr1-28 
Thermoactinomyces SDY22851.1 FNPL01000003.1:4278-49984 Bacilli Not Provided GenBank? 


sp. DSM 45892 


“PubMed ID of the paper that discusses the isolation and/or genome of the host organism. 


Direct submission to GenBank. 


4. Discussion 


RcGTA is hypothesized to package DNA via a headful mecha- 
nism because RcGTA particles encapsidate random pieces of R. 
capsulatus’ DNA, which would likely be facilitated by a non- 
sequence-specific TerL (Casjens et al. 2005; Hynes et al. 2012). 
Previous experiments support the hypothesis that RcGTA uti- 
lizes headful packaging because the packaged DNA fragments 
have different sequences at the ends (Hynes et al. 2012). The 
large dataset of available TerL sequences allowed us to obtain 
phylogenetic evidence that the RcGTA-like TerLs evolved from 
the large terminases of headful-packaging phages (Fig. 1). Our 
findings suggest that RcGTA either continues to employ the 
headful packaging strategy of its ancestor or modified it into a 
unique DNA packaging strategy. 

Previous studies have reported that the RcGTA TerL was 
most closely related either to the TerLs of T7-like viruses, which 
use a sequence-specific packaging mechanism (Casjens et al. 
2005), or to the TerLs of T4-like viruses, which employ the head- 
ful packaging mechanism (Lang and Beatty 2000; Hynes et al. 
2012). However, we found that the RcGTA-like TerLs are more 
similar to the TerLs of headful-packaging P22-like viruses. This 
discrepancy is likely due to the vastly expanded set of viral 
sequences now available in GenBank and the more sensitive 
search method that we used to identify viral TerL homologs. 

In further support of the origin of RcGTA from a virus that 
employed a headful packaging mechanism, several structural 
proteins of RcGTA have the highest sequence and secondary 
structure similarity to the corresponding proteins in viruses 
that also utilize a headful packaging strategy (Bardy et al. 2020). 
Specifically, the tail tape measure protein of RcGTA is homolo- 
gous to the tail-needle protein of bacteriophage P22, domains of 
the RcGTA hub and megatron proteins are homologous to their 
counterparts in bacteriophage T4, and the RcGTA stopper and 
tail terminator proteins are homologous to those from bacterio- 
phage SPP1 (Bardy et al. 2020). 

We identified TerLs from several viruses and predicted pro- 
phages that are phylogenetically closer relatives of the RcGTA 
TerL than P22-like TerLs. These viruses and predicted pro- 
phages either infect alphaproteobacteria or at least are found in 
the same environments as GTA-containing alphaproteobacte- 
ria. Because the specific mechanisms of headful packaging dif- 
fer among phages (Casjens et al. 1992, 2004; Bhattacharyya and 
Rao 1993), experimental characterization of packaging in vi- 
ruses that are closely related to GTAs could offer further insight 
into the origin of the GTA Terls and their packaging 
mechanism. 

The TerL phylogeny presented here supports the single ori- 
gin of the RcGTA head-tail cluster in alphaproteobacteria be- 
cause large RcGTA-like elements grouped together. With the 


newfound support for a single origin of RCGTA-like TerLs from a 
TerL of a headful-packaging phage, we propose that a headful- 
packaging TerL in the RcGTA ancestor underwent key changes 
that resulted in the switch from packaging the GTA genome to 
packaging random, small pieces of the host genome (Hynes 
et al. 2012) with a substantially lower density of DNA in the cap- 
sid (Bardy et al. 2020). The selection for reduced energy cost of 
GTA protein production that apparently occurred after the ori- 
gin of RcGTA-like elements in alphaproteobacteria (Kogay et al. 
2020) makes it challenging to pinpoint amino acid changes in 
GTA TerLs that contribute to this transition. 

The loss of specificity for the GTA genome and the reduction 
in the DNA packaging density rule out self-propagation of the 
GTA genome mediated by virions. Strikingly, the RcGTA genes 
appear to be actively precluded from packaging, being the least 
frequently packaged region in the alphaproteobacterial genome 
that is incorporated into the GTA particles (Hynes et al. 2012). 
The mechanism behind this exclusion is unknown, one possi- 
bility being that intensive expression of the RcGTA genes inter- 
feres with their packaging (Hynes et al. 2012). 

In addition to TerL, other GTA proteins that are involved in 
DNA packaging, including TerS and portal, might contribute to 
the unique DNA packaging features of RcGTA. In particular, 
TerS proteins determine where packaging initiates and control 
the specificity of packaging through the recognition of pac sites 
(Leavitt et al. 2013). However, no discrete packaging start sites 
were found in the RcGTA genome (Hynes et al. 2012). The 
RcGTA gene gi, which is adjacent to the terL gene (g2) in the 
RcGTA genome, has been recently shown to encode TerS 
(Sherlock, Leong, and Fogg 2019). Sherlock, Leong, and Fogg 
(2019) demonstrated that the RcGTA TerS binds non-specifically 
to DNA with low affinity due to the absence of a specific DNA- 
binding domain and the retention of non-specific DNA binding 
activity. A TerS protein with altered DNA-binding characteris- 
tics and a modified headful TerL might together underlie the 
random packaging that is characteristic of RCGTA. 
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